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Abstract 

^5 I n this paper we review recently developed methods for nonparametric 

Bayesian inference for one-dimensional diffusion models. We discuss dif- 

ferent possible prior distributions, computational issues, and asymptotic 

^-J results. 
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1 Introduction 

Stochastic Langevin models or diffusion models arise in many fields of applied 
science. Basically they describe the evolution of a system whose dynamics are 
governed by a "noisy" ordinary differential equation. If X t G R d denotes the 
p<| state of the system at time t, then the general form of such an equation is 

j>! < —t~ — b(t,X t ) + "Gaussian white noise". (1-1) 

• i— i ™ 

The function b describes the instantaneous drift of the process X. This drift 
is perturbed by random noise, with an intensity that can be time and state 
dependent in general. A more detailed, but still informal description and many 
examples of applications in the context of molecular modeling can for instance 
be found in Section 14.4 of Schlick (2010). 

As explained in Section 14.5 of Schlick (2010), we can view (1.1) as an 
informal description of the stochastic differential equation (SDE) 

dX t = b(t,X t )dt + a(t,X t )dW t . (1.2) 

Here W is a Brownian motion, which models the random noise, and the diffusion 
function a describes the impact of the current time and state on the level of 
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the noise (see the next section for more details). In this paper we restrict 
our attention to one-dimensional models, i.e. models for which the state X t 
is real-valued. Moreover, we consider only time-homogenous SDEs, meaning 
that b and a are only functions of the state. Such one-dimensional diffusion 
models are applied in many different fields in the biosciences. Often they arise 
after a reduction of a higher-dimensional system to dimension one, achieved for 
instance by suitable aggregation of data, or by a principle component analysis. 
Some concrete examples include models for the membrane potential in a neuron 
(e.g. Lansky ct al. (1990)), population size models (Fleming (1975)), decision 
making models (Roxin and Ledberg (2008)), reduced models for neurodynamical 
data (Deco et al. (2009)) and models in molecular dynamics (Papaspiliopoulos 
et al. (2012)). 

Fitting a diffusion model to observed data amounts to estimating the func- 
tions b and a. In certain cases it is reasonable to postulate a specific parametric 
form of these functions, i.e. to assume that they are known up to some finite- 
dimensional parameter. A well-known example is the mean-reverting Ornstein- 
Uhlenbeck process, for which a is constant and b(x) = a(x — /3) for some a < 
and /? S K. This is the classical Langevin equation, cf. Langcvin (1908). Fitting 
this model reduces to estimating the parameter 8 = (a, (3, a) G M + xMxIR + . See 
for instance Kutoyants (2004) and Kcsslcr ct al. (2012) for an overview of meth- 
ods. In other cases however, natural parametric specifications are not possible 
or undesirable and one has to resort to nonparametric methods for making infer- 
ence on the functions b and a. Several such methods have been proposed in the 
literature. An incomplete list include kernel methods (e.g. Banon (1978), Ku- 
toyants (2004), Van Zanten (2001)), penalized likelihood methods (e.g. Comte 
ct al. (2007)), and spectral approaches Bandi and Phillips (2003). 

The statistical techniques mentioned thus far are all "frequentist" in nature, 
i.e. non-Bayesian. In almost all branches of applied statistics however, the use 
of Bayesian methods has hugely increased in recent years. This is to a large 
extent due to the development of computational methods. Another appeal is 
that a Bayes procedure provides a natural way to quantify the uncertainty in 
the estimates, through the spread of the posterior distribution. Moreover, the 
construction of a prior distribution, as required in the Bayesian paradigm, can 
be a useful tool to bring structure to a complex statistical model. Initially 
the use of Bayes methods in nonparametric problems was met with skepticism, 
since it was pointed out early on that prior specification is a delicate matter in 
these cases (e.g. Freedman (1963, 1965)). However, mathematical and practical 
insights of the last decade have shown that these difficulties can be overcome. As 
a result Bayesian methods are now widely used in problems like nonparametric 
regression, density estimation and classification, in many different application 
areas (see for instance Hjort ct al. (2010) for an overview). 

The development and study of Bayesian methodology for SDEs started rel- 
atively late and initially focussed on parametric models. See for instance the 
papers Eraker (2001), Roberts and Stramer (2001), Beskos et al. (2006a), to 
mention but a few. Only a handful of papers about nonparametric Bayes meth- 
ods for SDEs are available at the present time. The first paper to propose a 
practical method is Papaspiliopoulos et al. (2012). The theoretical, asymptotic 
behavior of the procedure of Papaspiliopoulos et al. (2012) is studied in the pa- 
per Pokcrn ct al. (2012). Schauer et al. (2012) recently proposed an alternative 
computational approach. Other available papers deal with asymptotics in this 
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framework, cf. Van der Meulen et a!. (2006), Panzar and Van Zanten (2009) and 
Van der Meulen and Van Zanten (2012), Gugushvili and Sprcij (2012). 

In the present paper we review the recently developed nonparametric Bayes 
methods for SDEs. An interesting aspect of these methods is that they provide a 
natural way for dealing properly with low-frequency data. Moreover, they allow 
to report credible bands for uncertainty quantification in addition to an estima- 
tor of the function of interest. The few examples that exists at the moment show 
that the methods have become numerically feasible. As a result, nonparametric 
Bayes methods can be expected to become more and more common tools for 
fitting SDE models to observed data. 

Throughout the paper a balance is sought between mathematical rigor and 
a lucid presentation. This means that statements are sometimes loosely formu- 
lated, or that regularity conditions are taken for granted. We give references to 
the literature for readers seeking more mathematical detail. 

The remainder of the paper is organized as follows. In the next section 
we briefly treat some facts from the theory of stochastic differential equations, 
mainly to recall some terminology and fix notation. In Section 3 we discuss 
generalities about doing nonparametric Bayesian inference for diffusions. In 
particular, the differences between continuous and low-frequency data are out- 
lined. Recently proposed concrete priors are considered in Section 4. Section 
5 gives an overview of the available asymptotic theory. We end with some 
concluding remarks in Section 6. 

2 One-dimensional SDEs 

In this section we very briefly review some relevant theory of one-dimensional 
stochastic differential equations. 

2.1 Brownian motion 

The fundamental building block for stochastic differential equations is the Brow- 
nian motion process. Formally, a collection of random variables W = (Wt ■ t > 
0) defined on a common probability space is called a (standard) Brownian mo- 
tion if 

1. Wo = 0, 

2. for aU t > s > 0, Wt — W s is independent of (W u : u < s), 

3. for all t > s > 0, Wt — W s has a normal distribution with mean and 
variance t — s, 

4. almost every sample path 1 1— > Wt is continuous. 

Brownian motion plays a crucial role in stochastic process theory and has been 
and still is studied extensively. See for instance the books Revuz and Yor (1999) 
and Morters and Peres (2010) and the many references therein. 

It follows immediately from the definition that the Brownian motion W is 
a Gaussian process with mean ~EWt = and covariance EW S W^ = min{s,£} for 
all s,t > 0. Although it can proved that the ordinary derivative of Brownian 
does not exist, it can be thought of as the primitive of Gaussian white noise. As 
such it is instrumental in putting the loosely described "ODE with noise" (1.1) 
on a firm mathematical basis. 
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2.2 Stochastic differential equations 

Mathematically, the stochastic differential equation (1.2) is shorthand notation 
for the corresponding integral equation 

X t =X + [ b(s,X s )ds+ f o{s,X s )dW t . (2.1) 
Jo Jo 

Here the second integral is a stochastic integral, which has to be carefully de- 
fined and which obeys different calculus rules than ordinary integrals do. In 
particular, the main rule of calculus is replace by Ito's formula, which states 
that if X solves (1.2) and / is a twice continuously differentiable function, then 
f(X) satisfies the SDE 

df(X t ) = (f>(X t )b(t,X t ) + ^f(X t )<T 2 (t,X t ))dt + f'(X t )<j(t,X t )dW t . 

See any text on stochastic calculus for details (e.g. Chung and Williams (1990), 
Karatzas and Shreve (1991), 0ksendal (2003)). The stochastic integral in (2.1) 
is well approximated (in probability) by the so-called Riemann-Ito sum 

n 

^2<r((i - l)t/n,X { ^ 1)t/n )(W it/n - W (i _ 1)t / n ) 
i=i 

if n is large enough. 

It can be proved that under certain regularity conditions on the functions b 
and cr, for instance the classical Lipschitz and linear growth conditions, there 
exists a unique stochastic process X which solves the integral equation (2.1) 
(e.g. Karatzas and Shreve (1991)). Moreover, this solution is adapted to the 
Brownian motion W, in the sense that for every t > 0, X t only depends on 
(W s : s <t), i.e. it does not "look into the future". 

The notation (1.2) is reasonable since it correctly describes the infinitesimal 
behavior of the solution X of (2.1) in the sense that loosely speaking we have 
for very small h > that 

X t+h nX t + b(t, X t )h + a(t, X t )(W t+h - W t ). 

By the properties of the Brownian motion and the fact that X is adapted, we 
thus have that in distribution, 

X t+h « X t + b(t, X t )h + Vha(t, X t )Z, 

where Z is a standard normal random variable independent of X t . This gives a 
basic method for recursively simulating solutions to SDEs, the so-called Eulcr 
scheme. See for instance Kloeden and Platen (1992) for (much) more on this 
topic. 

2.3 Girsanov's theorem 

Below we will restrict our attention to the solution of a stochastic differential 
equation of the form 

dX t = b{X t ) dt + dW t , X a = x . 
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For T > 0, the restriction X — (X t : t € [0,T]) of this process to the interval 
[0, T] is a random element of the space C[0, T] of continuous functions on [0, T]. 
As such it generates a distribution, or law, PjT on this function space, defined 
by ff(B) = V(X € B) for (Borel) subsets of C[0,T\. 

Girsanov's theorem implies that the distribution has a density relative 
to the distribution Pq of the Brownian motion W = (Wt ■ t € [0, T]). Moreover, 
we have an expression for the corresponding Radon-Nikodym derivative, or, in 
statistical terminology, the likelihood. We have 

^W^exp^-^ b 2 (X t )dt + j^ b(X t )dX t ) 

(see for instance Liptser and Shiryaev (2001)). 

3 Bayesian inference for SDEs 

Suppose we observe a stochastic process X which solves a one-dimensional SDE 
of the form 

dX t = b{X t )dt + a(X t )dW u 

where W is a Brownian motion and b and a are functions satisfying certain 
regularity conditions ensuring at least that the SDE has a unique (weak) solution 
for every initial condition. Say we observe the process up till some time T > 0. 

Estimating the diffusion function a 2 is a degenerate statistical problem, at 
least if the data are recorded continuously, in the sense that its restriction to 
the range of the data can be recovered without error. We can use the fact that 
the quadratic variation of the process X is given by 

(X) t = [ <r 2 (X s )d S . 
Jo 

The left-hand side of this equation is a measurable function of the data, we have 
for instance that as n — > oo, 

n 

/,( X it/n - ^(i-l)t/«) 2 -> (X) t 

8=1 

in probability, for every t > (e.g. Jacod and Shiryaev (2003)). Assuming 
therefore that a is known we can instead of the original process X consider the 
transformed process 

/ — -r- dx, t > 0. 
Jo v(x) 

By Ito's formula this process has unit diffusion and the statistical problem re- 
duces to making inference about its drift function. 

In view of these observations we will assume throughout the remainder of 
the paper that the SDE under consideration has unit diffusion and focus on 
estimating the drift. In the case of low- frequency data, the transformation 
outlined above can not be carried out however, and the extension of the methods 
we discuss ahead to the case of an unknown diffusion function is not always 
straightforward. We refer to Roberts and Stramer (2001) for an approach that 
allows a parametric description of the diffusion function. 
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3.1 Continuous-time observations 



Suppose that for T > 0, we observe the unique solution X = (X t : t £ [0, T]) of 
the SDE 

dX t = b(X t ) dt + dW t , X Q = x , (3.1) 

where W is a Brownian motion and b : R — ¥ M. is an unknown, continuous drift 
function. By Girsanov's theorem, the law that the process X generates on the 
space C[0, T] of continuous functions on [0, T] is equivalent to the Wiener mea- 
sure on the space (with the appropriate initial condition) and the corresponding 
likelihood satisfies 

p(X\b) = eiq>(-^j b 2 (X t )dt + J b{X t )dX^ (3.2) 

(see Section 2.3). 

The nonparametric Bayesian approach now consists in putting a prior dis- 
tribution II on the "parameter" b and computing the corresponding posterior 
distribution. Formally the prior II can be any probability measure on the space 
C(R) of continuous functions on K or on a suitable subspace, for instance a 
space of functions with a certain regularity, and/or certain periodicities. The 
posterior distribution of b, which we denote by II(- \ X), is then given by the 
usual Bayes formula 



U(b£ B\X) = 



J B p(b\X)Tl(db) 

j P (b\x)u(db) ■ 



(In concrete situations it has to be verified that the integrands are properly 
measurable, so that the integrals are well defined.) 

The integrals in the expression for the posterior are over infinite-dimensional 
spaces, which makes it challenging to do computations. We will see in Section 4 
ahead however that various sensible choices for the prior allow the construction 
of feasible algorithms for drawing realizations from the posterior. 

In general the drift b in (3.1) is a function defined on the whole real line. 
This makes it not completely obvious to come up with reasonable priors, as most 
priors available in the literature are defined on spaces of compactly supported 
functions. However, we can usually work with such more common priors in the 
SDE case as well. In certain applications it is natural to assume that 6 is a 
periodic function, reducing the problem to estimating a function on [0, 2n], or 
another finite interval. This is for instance the case if the available data consist 
of recordings of angles (e.g. Pokern (2007), Hindriks (2011) or Papaspiliopoulos 
et al. (2012)). But also if periodicity can not be assumed it is typically only 
sensible to estimate the function 6 on a compact interval JcM, since far away 
from the range of the data there is simply no information available about the 
function. Now note that if we define, for a set Scl, 



bs(x) 



b(x) if x £ S, 
else, 



then, with I c denoting the complement of the interval /, the likelihood factorizes 
as p(X | b) = p(X | bj)p(X \ &j<=). It follows that if we put a prior on b by putting 
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independent priors 11/ and IT/c on bj and 6/c, respectively, then the marginal 
posterior for 6/ does not depend on the prior Il/c and is given by 



11(6, €B\X) = 



j- B p(b I \X)IV I {db) 
$p{b I \X)n I (db) ' 



Hence in this case as well, we only need to specify a prior on a compactly 
supported function. 

In the examples in Section 4 we shall consider the periodic case, but simple 
modifications allow to deal with the non-periodic case as well. 



3.2 Low-frequency observations 

In the preceding subsection we have been dealing with continuously observed 
diffusions. Obviously, the phrase "continuous data" should be interpreted prop- 
erly. In practice it means that the frequency at which the diffusion is observed 
is so high that the error that is incurred by approximating the stochastic and 
ordinary integrals like the ones appearing in the likelihood (3.2) by the cor- 
responding Riemann or Riemann-Ito sums, is negligible. If we only have low- 
frequency, discrete-time observations at our disposal, these approximation errors 
can typically not be ignored however and can introduce undesired biases. 

Assume now that we only have partial observations Xq, JTa, ■ ■ ■ , X n A of the 
solution of (3.1), for some A > and n € N. We set T — iiA. Under mild 
regularity conditions the discrete observations constitute a Markov chain, but 
it is well known that the transition densities of discretely observed diffusions 
and hence the likelihood is not available in closed form in general. This com- 
plicates a Bayesian analysis. An approach that has been proven to be very 
fruitful is to view the continuous diffusion segments between the observations 
as missing data and to treat them as latent (function-valued) variables. Since 
the continuous-data likelihood is known (cf. the preceding subsection) , data aug- 
mentation methods (see Tanner and Wong (1987)) can be used to circumvent 
the unavailability of the likelihood in this manner. 

Concretely, let us again denote the full, continuous-time process up till time 
T by X — (X t : t £ [0, T]). Asume that we have solved the inference problem 
for the continuous data problem described in the preceding subsection, in the 
sense that we have an algorithm that generates (approximate) draws from the 
posterior distribution II(- 1 b) of & given the full data X. The data augmentation 
method relies on the fact that in the present situation it is also possible to 
generate draws from the conditional distribution 

X | b, Xq , A a , ■ • ■ , X n A 

of the full process X given the discrete-time observations X , X&, . . . , X n & (de- 
tails ahead). Approximate draws from the target posterior distribution, i.e. the 
distribution of the drift given Ao,Xa, ■ ■ ■ , A„a, can then be obtained from a 
Gibbs sampler which is initialized at some function b and then repeats the steps 

1. draw X \ b, X , X A , . . . , X nA , 

2. draw b \ X 

a large number of times. 
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By the Markov property of the diffusion, step 1. in the Gibbs sampler can 
be done by independently drawing the n missing segments 

(X t : t G ((t - 1)A, iA)) | b, X (i _ 1)A , X lA (3.3) 

for i = 1, . . . , n, and gluing them together to obtain the full path X. The crucial 
observation is that the diffusion bridge law (3.3) is equivalent to a Brownian 
bridge that starts in X(j_i)A at time (i — 1)A and ends up in X$a at time 
iA. Moreover, by Girsanov's theorem again, the corresponding Radon-Nikodym 
derivative is proportional to 

exp ( f b(X t ) dX t - \ [ b 2 (X t ) dt] . 

W(i-l)A Z J{i-l)A ' 

Since it is straightforward to simulate Brownian bridges, this makes it possible 
to simulate diffusion bridges using for instance rejection sampling or Metropolis- 
Hastings techniques. Exact simulation methods for diffusion bridges have been 
proposed in the literature as well, see for instance Beskos et al. (2006a) , Bcskos 
et al. (2006b). For the present purposes exact simulation is not strictly necessary 
however and it is usually more convenient to add a Metropolis-Hastings (MH) 
step corresponding to a Markov chain that has the diffusion bridge law given by 
(3.3) as stationary distribution. For more details on this type of MH samplers 
for diffusion bridges we refer to Roberts and Stramer (2001). 



4 Gaussian and conditionally Gaussian priors 

For successful Bayesian inference for SDEs it is obviously important that a prior 
is used that makes the procedure computationally feasible. Moreover, to avoid 
inconsistency problems, we should aim at using priors with "large support" , in 
the sense that they do not exclude too many drift functions. In this section we 
review recently proposed options. 



4.1 Finite-dimensional priors 

A first, perhaps naive approach to constructing a prior on the drift function b is 
to choose a finite set of basis functions ipi, . . . , ip m , assume that the drift admits 
an expansion b = Y)j—x Cjipj and to put a prior distribution on the vector of 
coefficients c = (c\, . . . , c m ). In terms of c the likelihood can then be written as 

where the data enters through the vector [i and matrix £ with components 

fij= [ ^(X t )dX t , E«= / A(Xt)^j(X t )dt, 
Jo Jo 

for i,j = 1, . . . , m. Hence if the prior on the coefficients c has a Lebesgue density 
7r, then the posterior distribution of c has a density proportional to 

c4i(c)e cV ^ cT& . 
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Given draws from this posterior for c we obtain draws from the posterior for b 
by combining the coefficients with the basis functions. 

Since the likelihood is log-quadratic, it is convenient to choose a Gaussian 
prior on c. It is straightforward to verify that if the prior on c is N m (0, A), i.e. an 
m-dimensional normal distribution with mean and covariance matrix A, then 
the posterior for c is Gaussian as well, namely 7V m ((S-|-A _1 ) _1 /L(, (S+A -1 ) -1 ). 
Sampling from this posterior distribution is straightforward in principle, al- 
though the necessary matrix inversions can become numerically demanding as 
to gets large. It can be advantageous to employ basis functions leading to a 
sparse matrix S, in order to speed up the matrix computations. 

The sketched procedure can work quite well, but only if the drift is in actual 
fact (close to) a linear combination of the chosen basis functions. When using a 
prior on a space of functions with a fixed, finite dimension, only the projection 
of the true drift on this space can be recovered. This can look quite different 
than the actual drift. Figure 1 illustrates this point. Here we simulated data 
from the SDE (3.1) with drift function b(x) = -(l/2)a;(a;-l)(^ + l). We defined 
a prior on b by dividing the interval [—2,2] into 20 subintervals of equal length 
and writing b as a linear combination of indicator functions of these intervals, 
with independent Gaussian coefficients. The lower left-hand panel of Figure 1 
visualizes the corresponding posterior. The solid blue line is the posterior mean 
and the dashed lines describe .95 point wise credible intervals. Clearly, this 
posterior only gives a very crude picture of the true drift (which is the solid 
black curve). We note that the credible bands are very wide near —2 and 2, 
since only very limited data fall into that region, cf. the histogram of the data 
in the lower right-hand panel of the figure. 

In the next two subsections we describe two recently proposed procedures 
that avoid this problem by employing truly infinite-dimensional prior distribu- 
tions for the drift, both with large support. 

4.2 Gaussian priors with differential precision operators 

Let us assume that the drift function b in (3.1) is continuously differentiable, 
1-periodic and zero-mean, in the sense that J b{x) dx = 0. For this situation 
Papaspiliopoulos et al. (2012) propose a centered Gaussian prior on the space 
L 2 [0, 1] of square integrable functions on the unit interval. In general, a centered 
Gaussian measure II on L 2 [0,1] is determined by its covariance operator A : 
L 2 [0, 1] — > L 2 [0, 1] which has the property that 

J [J g{x)f{x)dx)( K J^ h(x)f(x)dx)lL(4f)= J g(x)(Af)(x)dx 

for all g,h 6 L 2 [0, 1]. A linear operator on L 2 [0, 1] is a covariance operator of 
a Gaussian measure if and only if it is positive definite, symmetric, and trace- 
class (e.g. Bogachev (1998)). The covariance operator of Papaspiliopoulos et al. 
(2012) is defined through its inverse, the so-called precision operator. Given 
fixed hyperparameters 77, k > and p € {2,3, . . .} this inverse is the densely 
defined operator 

\- 1 = v ((-Ay + Ki), (4.1) 
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Figure 1: Top panel: simulated data. Lower left: true drift (black), posterior 
mean (solid blue) and .95 pointwise credible intervals (dashed blue). Lower 
right: histogram of the data. 



where A the one-dimensional Laplacian, i.e. A/ = /" and / is the identity 
operator. The domain of A -1 is the space of periodic, zero- mean functions 
with Sobolev regularity 2p. It can be shown that this indeed defines a proper 
Gaussian prior II on L 2 [0, 1]. The hyper parameter p determines the regularity 
of the prior in some sense. As shown in Pokern et al. (2012), the prior gives mass 
1 to a space of functions that have Holder regularity a for every a < p — 1/2. 

It turns out that for this prior we can explicitly derive the correspond- 
ing posterior. It is a Gaussian measure on L 2 [0,1] again and we can ob- 
tain expressions for posterior mean and precision operator. These expres- 
sions involve the periodic local time of the diffusion. This is the random field 
L° = [L?p{x) : T > 0,i £ [0, 1]) with the property that for every 1-periodic, 
nonnegative measurable function / and T > 0, 

f(X t )dt= f f(x)L° T (x)dx. 
Jo 

Moreover, we need the random held x° defined by 

\iX Q <X T , 
ifX T <X , 
else. 



eZ:X <x + k< X T } 
XtO) = { eZ:X T <x + k<X a } 
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A non-rigorous derivation of the posterior now proceeds by first noting that 
by Ito's formula and periodicity, the likelihood (3.2) can be written as 

p(X | b) = exp ( - (L° T (x) (b 2 (x) + b'(x)) - 2 X ° T (x)b(x)) dx 

Next we observe that, very loosely speaking, the Gaussian prior II has a "den- 
sity" proportional to 

b H> exp ^ — b(x)A~ 1 b(x) dx 

But then the posterior has a "density" that is proportional to the product of 
these two quantities. Using also integration by parts we see that this equals 



exp 



( " \j Q b ^ A ~ l + L° T {x))b{x) dx + J b(x)Q-{L° T (x))> + xHxj) dx) . 



This is, up to a constant, again the "density" of a Gaussian measure. By 
completing the square we see that its mean bx and covariance operator At 
satisfy A^ 1 = A -1 + L° T 1 and A^bx — \{L T )' + %t- Recalling the definition 
of A we obtain the relations 

Ay 1 = 77 (-Af + (r,5 + L° T ) I, 
77 (-Af b T + (vS + L° T )b T = 1 -{L T )> + X ° T . 

To make the derivation of the posterior rigorous the above differential equa- 
tion for the posterior mean has to be understood in an appropriate weak sense, 
since the ordinary derivative of local time does not exist. As detailed in Pokern 
et al. (2012) this is indeed possible and it can be shown that the differential 
equation for bx has a unique weak solution. Well-established methods from 
numerical analysis can be then used to compute the posterior. For further de- 
tails and application of the approach in problems from molecular dynamics and 
finance, we refer to Papaspiliopoulos et al. (2012). 

The approach of Papaspiliopoulos et al. (2012) can be compared to the naive 
approach outlined in the preceding subsection by noting that their prior II can 
in fact equivalcntly be described by a series expansion. Define the basisfunctions 

■02fc(a^) = \/2 cos(2irkx), 
ip2k~i(x) = v2 sin(27rfea;) , 

for k E N. It is easily verified that the functions ipk are the eigenfunctions of 
the prior precision operator A -1 , and the corresponding eigenvalues are given 

by 

K 1 = '/ 

It follows that the prior on b can also be defined structurally by expanding 
b = Y^k=i c k4>k and putting independent Gaussian priors on the coefficients c^, 
with mean and variance Afe. 
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So compared to the naive finite series approach, the proposal of Pa- 
paspiliopoulos et al. (2012) is genuinely nonparametric and eliminates the chance 
of missspecifying the form of the drift. On the down side, the prior has fixed 
hyper parameters that still might combine poorly with the true drift. In par- 
ticular, there remains a possibility that a multiplicative scaling parameter rj is 
chosen that incorrectly reflects the scale of the actual drift. This can deterio- 
rate the quality of the inference. In the next subsection we discuss a prior which 
allows for a data-driven choice of the scaling. 

4.3 Infinite series priors 

In this section we consider the approach proposed by Schauer et al. (2012). We 
consider the same setup as before, i.e. the drift b is assumed to be 1-periodic. 

The prior of Schauer et al. (2012) is defined by writing a truncated series 
expansion for b and putting prior weights on the truncation point and on the 
coefficients in the expansion. We employ general 1-periodic, continuous basis 
functions ipk, k £ N. Next we fix an increasing sequence of natural numbers 
rrij, j £ N, to group the basis functions into levels. The functions ip\,..., ip mi 
constitute level 1, the functions ip mi +i, ■ ■ ■ j^ma correspond to level 2, etcetera. 
In this manner we can accommodate both families of basis functions with a single 
index (e.g. the Fourier basis) and doubly indexed families (e.g. wavelet-type 
bases). Functions that are linear combinations of the first rrij basis functions 
«/»!,..., Tp mj are said to belong to model j. Model j encompasses levels 1 up till 
3- 

To define the prior on b we first put a prior on the model index j, given by 
certain prior weights p(j), j € N. By construction, a function in model j can be 
expanded as Yli=i ®l V 7 ; f° r a certain vector of coefficients 9^ £ M mj . Given j, we 
endow this vector with a prior by postulating that the coefficients Q\ are given 
by an inverse gamma scaling constant times independent, centered Gaussians 
with certain decreasing variances. 

Concretely, to define the prior we fix model probabilities p(j), j £ N, 
decreasing variances £f, I £ N, positive constants a, b > and set Hj = 
diag(£i , . . . , Then the hierarchical prior IT on the drift function b is defined 

as follows: 

j ~Ki)> 

s 2 ~IG(o,6), 

rrij 

i=i 

In the paper Schauer et al. (2012) particular choices for the basis functions, the 
prior on j and the variances £z are considered in more detail. 

This prior proposed by Schauer et al. (2012) a different from the one con- 
sidered above in a number of ways. Firstly, different basis functions can be 
used. This added flexibility can be computationally attractive. The posterior 
computations involve the inversion of certain large matrices and choosing basis 
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functions with local support typically makes these matrices sparse. A second 
difference is that we truncate the infinite series at a level that we endow with a 
prior as well. In this manner we can achieve considerable computational gains if 
the data driven truncation point is relatively small, so that only low-dimensional 
models are used and hence only relatively small matrices have to be inverted. A 
last important change is that we do not set the multiplicative hyper parameter 
at a fixed value, but instead endow it with a prior and let the data determine 
the appropriate value. 

The simulations presented in Schauer et al. (2012) indicate that this ap- 
proach has several advantages. Although the truncation of the series at a data 
driven point involves incorporating reversible jump MCMC steps in the compu- 
tational algorithm, it can indeed lead to a considerably faster procedure com- 
pared to truncating at some fixed high level. Moreover, the introduction of a 
prior on the multiplicative hyper parameter reduces the risk of misspecifying 
the scale of the drift. Using a fixed scaling parameter can seriously deteriorate 
the quality of the inference, whereas the hierarchical procedure with a prior on 
that parameter is able to adapt to the true scale of the drift. Also, numerical 
investigations indicate that by the introduction of a prior on the scale we also 
can achieve some degree of adaptation to smoothness. 

The prior II described above is constructed in such a way that numerical 
computation is practically feasible. Within a fixed model j, a Gibbs sampler for 
sampling Qi and s 2 can be constructed using standard inverse gamma-normal 
computations. Reversible jump MCMC can be used to jump between different 
models. This involves the computation of certain Bayes factors, for which a 
closed form expression can be derived in this setup. If only low-frequency data 
are available, the Gibbs and reversible MCMC steps can be combined with data 
augmentation steps involving the simulation of diffusion bridges, as outlined 
also in the preceding subsection. For more details about computational issues 
and simulation examples we refer to Schauer et al. (2012). 

5 Asymptotics 

The negative examples of e.g. Frcedman (1963, 1965) or Diaconis and Frccdman 
( 1986a, b) show that in Bayesian nonparametrics, even intuitively reasonable 
priors may lead to inconsistent procedures. More generally, it is by now well 
known that contrary to the parametric setting, the choice of the prior has a large 
impact on the performance in infinite-dimensional models. This performance is 
determined by fine mathematical properties and can not be assessed by simply 
eyeballing the prior. As a result there is an interest in mathematical results 
that relate properties of the prior to the quality of the Bayes procedure. Such 
results can serve as guidelines for the selection or construction of priors. 

Mathematical results in this setting typically assume that the data are gen- 
erated using a true drift function bo and study if and how the posterior concen- 
trates around bo as more and more data become available. In the continuous- 
time case in which we observe the diffusion on a time interval [0, T] this means 
we let T —¥ oo. In the low- frequency case it simply means we let the number 
n of observations tend to infinity. Posterior consistency is the property that 
the posterior indeed contracts around bo, in the sense that asymptotically, any 
neighborhood (relative to a suitably define topology) of bo receives posterior 



13 



mass 1. This is a property that any reasonable procedure should ideally have. 
Once posterior consistency has been established, the rate at which the posterior 
contracts around the true 60 can be studied. In particular it can be investigated 
whether a certain prior leads to optimal convergence rates. 

For diffusion models, the paper Van der Meulen et al. (2006) was the first 
to systematically study convergence rates for nonparametric Bayes procedures. 
General conditions were derived for attaining a certain rate of contraction, in 
terms of the metric entropy of the support of the prior and the mass that the 
prior assigns to neighborhoods of the true function. These conditions are the 
analogues of similar conditions that were initially derived for the setting of i.i.d. 
density estimation by Ghosal et al. (2000). A concrete prior for ergodic diffusions 
considered in Van der Meulen et al. (2006) is a Dirichlet process like prior 
designed to model decreasing drift functions. This prior is shown to attain the 
optimal convergence rate T -1 / 3 (up to a logarithmic factor). Certain Gaussian 
process priors for the drift, essentially multiply integrated Brownian motions, 
are considered in the paper Panzar and Van Zantcn (2009). These priors have a 
certain fixed degree of regularity. Brownian motion has smoothness of order 1/2, 
integrated Brownian motion has smoothness level 3/2, etc. For such Gaussian 
priors the message is that if the regularity of the prior that is used coincides with 
the regularity of the unknown drift function, then optimal contraction rates are 
achieved. Specifically, if both the drift and the prior have regularity j3 > 0, 
then the posterior contracts around the true drift at the rate T~ ''' '( 1+2 P\ which 
can be shown to be optimal in a certain sense, cf. e.g. Kutoyants (2004). If 
the two regularities are not equal however, only sub-optimal speeds are realized 
in general. This is in line with general findings for Gaussian priors in other 
settings, see e.g. Van der Vaart and Van Zanten (2008), Castillo (2008). 

The concrete Gaussian prior with precision operator (4.1) described in Sec- 
tion 4.2 is analyzed in detail in the paper Pokern et al. (2012). As mentioned 
above, this prior has regularity /3 =p — 1/2. It is proved in Pokern et al. (2012) 
that the corresponding posterior contracts around the true drift at the rate 
j , -(2p-i)/(4j>) ) p r0 vided the drift has regularity p. Note that this rate is also the 
optimal T~P'( 1+2 P> , but the assumption on the drift is that it is /3 + 1/2-regular. 
It is expected however that also in the periodic setting of Pokern et al. (2012) 
this assumption on the drift can be weakened to /3-regularity, and that just as 
in the ergodic setting studied in Panzar and Van Zanten (2009), it holds that a 
Gaussian prior with fixed regularity is rate-optimal if and only if its smoothness 
matches the smoothness of the true drift. 

Priors that perform optimally across a whole range of regularities for the 
drift, i.e. so-called rate-adaptive priors have not yet been exhibited for diffusion 
models. It is expected that the prior of Schaucr et al. (2012) described in Section 
4.3 allows for a degree of adaptation to smoothness, but this has not yet been 
established. A combination of the general theory of Van der Meulen et al. (2006) 
and new local time asymptotics obtained in Pokern et al. (2012) are expected 
to shed further light on the matter in the near future. 

The asymptotic results mentioned thus far all concern continuously observed 
diffusions, where the accumulation of information is ensured either by ergodic- 
ity or by periodicity assumptions. The derivation of usable results for the low- 
frequency setting is much more involved. The fact that the discrete-time likeli- 
hood can not be explicitly expressed in terms of the drift complicates the analysis 
considerably. At the present time, the only available results deal with posterior 
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consistency relative to a weak topology, cf. Van der Meulen and Van Zanten 
(2012) and the extensions in Gugushvili and Spreij (2012). It is a great chal- 
lenge to obtain consistency results for stronger topologies and rate of contraction 
results for procedures based on low-frequency data. 

6 Concluding remarks 

Nonparametric Bayesian methodology for stochastic differential equations has 
started to develop only very recently. At this point in time there exist only 
a few methods that are computationally feasible. Moreover, the theoretical 
performance analysis of these methods is still rather immature. Nevertheless, 
the nonparametric Bayes approach can be expected to become more and more 
common in the near future, since it combines the advantages of flexible, non- 
parametric modeling with the possibility of providing uncertainty quantification 
and the possibility to deal with low-frequency data. This development will be 
stimulated by ongoing work on computational matters and theoretical founda- 
tions. 
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